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^>| Abstract 

• We have developed a numerical software library for collisionless TV-body simulations named "Phantom-GRAPE" which highly 

'■"^ accelerates force calculations among particles by use of a new SIMD instruction set extension to the x86 architecture, Advanced 

'"[""'Vector extensions (AVX), an enhanced version of the Streaming SIMD Extensions (SSE). In our library, not only the Newton's 

Q forces, but also central forces with an arbitrary shape f(r), which has a finite cutoff radius r cut (i.e. f(r) = at r > r cut ), can be 

}—j quickly computed. In computing such central forces with an arbitrary force shape f(r), we refer to a pre-calculated look-up table. 

C/j We also present a new scheme to create the look-up table whose binning is optimal to keep good accuracy in computing forces 

tt and whose size is small enough to avoid cache misses. Using an Intel Core i7-2600 processor, we measure the performance of our 

library for both of the Newton's forces and the arbitrarily shaped central forces. In the case of Newton's forces, we achieve 2 x 10 9 

^sj interactions per second with one processor core (or 75 GFLOPS if we count 38 operations per interaction), which is 20 times higher 

J> than the performance of an implementation without any explicit use of SIMD instructions, and 2 times than that with the SSE 

f" — ■ instructions. With four processor cores, we obtain the performance of 8 x 10 9 interactions per second (or 300 GFLOPS). In the case 

C*") of the arbitrarily shaped central forces, we can calculate 1 x 10 9 and 4 x 10 9 interactions per second with one and four processor 

^~2 cores, respectively. The performance with one processor core is 6 times and 2 times higher than those of the implementations 

r without any use of SIMD instructions and with the SSE instructions. These performances depend only weakly on the number of 

£f) particles, irrespective of the force shape. It is good contrast with the fact that the performance of force calculations accelerated by 

(""*) graphics processing units (GPUs) depends strongly on the number of particles. Substantially weak dependence of the performance 

OQ 6n the number of particles is suitable to collisionless TV-body simulations, since these simulations are usually performed with 

t - ^ sophisticated TV-body solvers such as Tree- and TreePM-methods combined with an individual timestep scheme. We conclude that 

^ collisionless TV-body simulations accelerated with our library have significant advantage over those accelerated by GPUs, especially 

on massively parallel environments. 
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1. Introduction 

Self-gravity is one of the most essential physical processes 
in the universe, and plays important roles in almost all cat- 
egories of astronomical objects such as globular clusters, 
galaxies, galaxy clusters, etc. In order to follow the evo- 
lution of such systems, gravitational TV-body solvers have 
been widely used in numerical astrophysics. 

Due to prohibitively expensive computational cost in 
directly solving TV-body problems, many efforts have been 
made to reduce it in various ways. For example, several 
sophisticated algorithms to compute gravitational forces 
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among many particles with reduced computational cost 
have been developed, such as Tree method (Barnes & 
Hut, 1986), PPPM method (Hockney & Eastwood, 1981), 
TreePM method (Xu, 1995), etc. 

Another approach is to improve the computational per- 
formance with the aid of additional hardware, such as 
GRAPE (GRAvity PipE) systems, special-purpose accel- 
erators for gravitational TV-body simulations (Sugimoto 
et al., 1990; Makino et al, 2003; Fukushige ct al., 2005), 
and general-purpose computing on Graphics Processing 
Units (GPGPUs). GRAPE systems have been used for 
further improvement of existing TV-body solvers such as 
Tree method (Makino, 1991), PPPM method (Brieu et 
al., 1995; Yoshikawa & Fukushige, 2005), TreePM method 
(Yoshikawa & Fukushige, 2005), P 2 M 2 tree method (Kawai 
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et al., 2004), and PPPT method (Oshino ct al., 2011). 
They have also adapted to simulation codes for dense stel- 
lar systems based on fourth-order Hcrmitc scheme, such as 
NBDDY4 (Johnson & Aarseth, 2006), NB0DY1 (Harfst ct al., 
2007), kira (Portegies Zwart ct al., 2008), and GORILLA 
(Tanikawa & Fukushigc, 2009). Recently, Hamada & 
Iitaka (2007), Portegies Zwart et al. (2007), Gaburov et al. 
(2009), and Bedorf et al. (2012) explored the capability of 
commodity graphics processing units (GPUs) as hardware 
accelerators for TV-body simulations and achieved similar 
to or even higher performance than the GRAPE-6A and 
GRAPE-DR board. 

A different approach to improve the performance of N- 
body calculations is to utilize Streaming SIMD Extensions 
(hereafter SSE), a SIMD (Single Instruction, Multiple 
Data) instruction set implemented on x86 and x86_64 pro- 
cessors. Nitadori et al. (2006) exploited the SSE and SSE2 
instruction sets, and achieved speeding up of the Hcrmite 
scheme (Makino & Aarseth, 1992) in mixed precision for 
collisional self-gravitating systems. Although unpublished 
in literature, Nitadori, Yoshikawa, & Makino have also 
developed a numerical library for A-body calculations in 
single-precision for collisionlcss self-gravitating systems in 
which two-body relaxation is not physically important and 
therefore single-precision floating-point arithmetic suffices 
for the required numerical accuracy. Furthermore, along 
this approach, they have also improved the performance in 
computing arbitrarily-shaped forces with a cutoff distance, 
defined by a user-specified function of inter-particle sepa- 
ration. Such capability to compute force shapes other than 
Newton's inverse-square gravity is necessary in PPPM, 
TreePM, and Ewald methods. It should be noted that 
GRAPE-5 and the later families of GRAPE systems have 
similar capability to compute the Newton's force multiplied 
by a user-specified cutoff function (Kawai et al., 2000), 
and can be used to accelerate PPPM and TreePM meth- 
ods for cosmological A-body simulations (Yoshikawa & 
Fukushigc, 2005). Based on these achievements, a publicly 
available software package to improve the performance of 
both collisional and collisionless A-body simulations has 
been developed, which was named "Phantom-GRAPE" 
after the conventional GRAPE system. A set of appli- 
cation programming interfaces of Phantom-GRAPE for 
collisionlcss simulations is compatible to that of GRAPE- 
5. Phantom-GRAPE is widely used in various numerical 
simulations for galaxy formation (Saitoh et al., 2008, 2009) 
and the cosmological large-scale structures (Ishiyama et 
al., 2008, 2009a,b, 2010, 2011). 

Recently, a new processor family with "Sandy Bridge" 
micro-architecture 1 by Intel Corporation and that with 
"Bulldozer" micro-architecture 2 by AMD Corporation 
have been released. Both of the processors support a new 
set of instructions known as Advanced Vector extensions 
(AVX), an enhanced version of the SSE instructions. In 
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the AVX instruction set, the width of the SIMD regis- 
ters is extended from 128-bit to 256-bit. We can perform 
SIMD operations on two times larger data than before. 
Therefore, the performance of a calculation with the AVX 
instructions should be two times higher than that with the 
SSE instructions if the execution unit is also extended to 
256-bit. 

Tanikawa et al. (2012) (hereafter, paper I) developed 
a software library for collisional A-body simulations us- 
ing the AVX instruction set in the mixed precision, and 
achieved a fairly high performance. In this paper, we 
present a similar library implemented with the AVX in- 
struction set but for collisionless A-body simulations in 
single-precision. 

The structure of this paper is as follows. In section 2, we 
overview the AVX instruction set. In section 3, we describe 
the implementation of Phantom-GRAPE. In section 4 and 
5, we show the accuracy and performance, respectively. In 
section 6, we summarize this paper. 



2. The AVX instruction set 

In this section, we present a brief review of the Ad- 
vanced Vector extensions (AVX) instruction set. Details of 
the difference between SSE and AVX is described in sec- 
tion 3.1 of paper I. AVX is a SIMD instruction set as well 
as SSE, and supports many operations, such as addition, 
subtraction, multiplication, division, square-root, approx- 
imate inverse-square- root, several bitwise operations, etc. 
In such operations, dedicated registers with 256-bit length 
called "YMM registers" are used to store the eight single- 
precision floating-point numbers or four double-precision 
floating-point numbers. Note that the lower 128-bit of the 
YMM registers have alias name "XMM registers" , and can 
be used as the dedicated registers for the SSE instructions 
for a backward compatibility. 

An important feature of AVX and SSE instruction sets 
is the fact that they have a special instruction for a very 
fast approximation of inverse-square-root with an accuracy 
of about 12-bit. Actually, this instruction is quite essential 
to improve the performance of the gravitational force cal- 
culations, since the most expensive part in the force cal- 
culation is an execution of inverse-square-root of squared 
distances of the particle pairs. As already discussed in Ni- 
tadori et al. (2006), the approximate values can be adopted 
as initial values of the Newton-Raphson iteration to im- 
prove the accuracy, and we can obtain 24-bit accuracy af- 
ter one Newton-Raphson iteration. For collisionless self- 
gravitating systems, however, the accuracy of ~ 12 bits is 
sufficient because the accuracy of inverse-square-root does 
not affect the resultant force accuracy if one adopts an ap- 
proximate A-body solver such as Tree, PPPM and TreePM 
methods. Therefore, we use the raw approximate instruc- 
tion throughout this study. 

Since the present-day compilers cannot always detect 
concurrency of the loops effectively, and cannot fully resolve 



the mutual dependency among data in the code, it is quite 
rare that compilers generate codes with SIMD instruc- 
tions in effective manners from codes expressed in high- 
level languages. For an efficient use of the AVX instructions, 
we need to program with assembly-languages explicitly or 
compiler-dependent intrinsic functions and data type ex- 
tensions. In assembly-languages, we can manually control 
the assignment of YMM registers to computational data, 
and minimize the access to the main memory by optimiz- 
ing the assignment of each register. In this work, we adopt 
an implementation of the AVX instructions using inline- 
assembly language with C expression operands, embedded 
in C-language, which is a part of language extensions of 
GCC (GNU Compiler Collection). 

3. Implementation 

Here, we describe the detailed implementation to accel- 
erate TV-body calculation using the AVX instructions. For a 
given set of positions r,i of N particles, we try to accelerate 
the calculations of a gravitational force given as follows: 
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where G is the gravitational constant, m j the mass of the 
j-th particle, and e the gravitational softening length. In 
addition to that, we also try to accelerate the computations 
of central forces among particles with an arbitrary force 
shape f{r) given by 
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where f(r) specifies the shape of the force as a function of 
inter-particle separation r with a cutoff distance r cut (i.e. 
f(r) = at r > r cut ). In the above expressions, parti- 
cles with subscript "j" exert forces on those with subscript 
"z" . In the rest of this paper, the former are referred to as 
"j-particles" , and the latter as "i-particles" just for conve- 
nience. 

Since individual forces exerted by j-particles on i- 
particles can be computed independently, we can calcu- 
late forces exerted by multiple j-particlcs on multiple 
i-particlcs in parallel. As described in the previous sec- 
tion, the AVX instructions can execute operations of eight 
single-precision floating-point numbers on YMM registers 
in parallel. By utilizing this feature of the AVX instruc- 
tions, the forces on four i-particles from two j-particles 
can be computed simultaneously in a SIMD manner. 

3.1. Structures for the particle data 

In computing the forces on four i-particles from two j- 
particles, we assign the data of i- and j-particles to YMM 
registers in the way shown in Figure 1. Suppose that a 
and b in Figure 1 are ^-components of i- and j-particles, 



respectively. Subtracting data in the YMM register (1) of 
Figure 1 from data in the YMM register (2) of Figure 1, 
wc simultaneously obtain x-componcnts of eight relative 
positions c in the YMM register (3) of Figure 1. 

In order to effectively realize such SIMD computations 
with the AVX instructions, we define the structures for i- 
particles, j-particles and the resulting forces and potentials 
shown in List 1. Before computing the forces on i-particles, 
the positions and softening lengths of i-particles are stored 
in the structure Ipdata, and the positions and masses of j- 
particles are in the structure Jpdata. The resulting forces 
are stored in the structure Fodata. Note that the structures 
Ipdata and Fodata contain the data of four i-particles, 
while the structure Jpdata has the data for a single j- 
particle. 

Note that the positions, softening lengths, and forces of i- 
particles in the structures Ipdata and Fodata are declared 
as arrays of four single-precision floating-point numbers. 
Thus, the data on each array can be suitably loaded onto, 
or stored from the lower 128-bit of one YMM register. The 
assignment of the i-particles data shown in (1) of Figure 1 
can be realized by loading the data of four i-particles onto 
the lower 128-bit of one YMM register, and copying the 
data to its upper 128-bit. 

As for j-particles, since the structure Jpdata consists of 
four single-precision floating-point numbers, wc can load 
the positions and the masses of two j-particles in one YMM- 
register at one time if they are aligned on the 32-byte 
boundaries. By broadcasting the n-th element (n = 0, 1, 2 
and 3) in each of the lower and upper 128-bit to all the other 
elements, we can realize the assignment of the j-particle 
data as depicted in (2) of Figure 1 . 

After executing the gravitational force loop over j- 
particles, the partial forces on four i-particles exerted by 
different sets of j-particles are stored in the upper and 
lower 128-bit of a YMM register. Operating sum reduction 
on the upper and lower 128-bit of the YMM register, and 
storing the results into its lower 128-bit, we can smoothly 
store the results into the structure Fodata. 

L ist 1. Structures for ^-particles, 7-particlcs, and the resulting forces. 



1 


// structure for i-particles 


2 


typedef struct ipdata{ 


3 


float x [4] ; 


4 


float y [4] ; 


5 


float z [4] ; 


6 


float eps2 [4] ; 


7 
8 
9 


> Ipdata, *plpdata; 


// structure for j-particles 


10 


typedef struct jpdata{ 


11 


float x, y, z , m; 


12 


} Jpdata, *pjpdata; 


13 




14 


// structure for the resulting fore 


15 


// and potentials of i-particles 


16 


typedef struct fodata{ 


17 


float ax [4] ; 


18 


float ay [4] ; 


19 


float az [4] ; 


20 


float phi [4] ; 


21 


} Fodata, *pFodata; 
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Fig. 1. Data assignments in YMM registers for SIMD calculations. The upper panel (1) shows the data assignment of four i-particles (with 
indices of ig, ti, 12, and 13), where the data are redundantly stored in the lower and upper 128-bit in the same order. The data of two 
j'-particles (with indices of jo and j\) are stored in the lower and upper 128-bit, respectively, as shown in the middle panel (2). The 8 values 
obtained by the operations between the data of four i-particles and two j'-particles are stored in the order shown in the lower panel (3): 
c 'j = f{ a itbj)- For example, Ci j is a result of operations between a,i Q and bj . 



3.2. Macros for inline assembly codes 

For the readability of the source codes shown below, let 
us introduce some preprocessor macros which are expanded 
into inline assembly codes. The definitions of the macros 
used in this paper are given in List 2. For macros with 
two and three operands, the results are stored in the sec- 
ond and third one, respectively, and the other operands are 
source operands. In these macros, operands named sre, 
srcl, src2, and dst designate the data in XMM or YMM 
registers, and those named mem, mem64, meml28, and mem256 
are data in the main memory or the cache memory, where 
numbers after mem indicate their size and alignment in bits. 
Brief descriptions of these macros are summarized in Ta- 
bic 1. More detailed explanation of the AVX instructions 
can be found in Intel's website 3 . 

List 2. Preprocessor macros for inline assembly codes. 



1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

IT 

18 

1!) 

20 

21 

22 

23 

24 

25 

26 



af ine 
3f ine 
asm ( " 
3f ine 
asm ( " 
3f ine 
asm ( " 
3f ine 
asm ( " 
3f ine 
asm ( 
3f ine 
asm ( " 
3f ine 
asm ( " 
' ,u7." 
af ine 
asm ( " 
u : : 
af ine 
asm ( " 
' ,u%" 
3f ine 
asm ( " 
3f ine 
asm ( " 



VZEROALL asmC'vzeroall") ; 

VL0ADPS(mem256 , dst) \ 
vmovaps u 7.0 , u 7. " dst : : "m 

VSTORPSCreg, mem256) 
vmovaps u 7. " reg ", u 7.0" 

VLDADPSCmeml28 , dst) 
vmovaps u 7.0 , u 7. " dst : : "m 

VSTORPSCreg, meml28) 
vmovaps u 7. " reg " , u 7.0" 

VLDADLPSCmem64 , dst) 
viovlps u '/,0 , u 7. " dst " , u 7."dst 

VLDADHPS(mem64 , dst) \ 
vmovhps u 7.0 , u 7. " dst " , u 7."dst 

VBCASTL128(src , dst) \ 
vperm2f 128 u 7.0 , u '/,"src " , u 7. " 
dst " u " : : "g" (0x00) ) ; 

VC0PYU128T0L128(src ,dst) 



1 (mem256) ) ; 

\ 
: : "m" (mem256) ) ; 

\ 
1 (meml28) ) ; 

\ 
: : "m" (meml28) ) 1 

\ 



a" Cn 

a" Cn 

\ 



a64) ) ; 
a64) ) ; 



vextractf 128 u y.O , u 7."src ", u '/."dst \ 
M g"(0x01)) ; 

VGATHERL128(srcl ,src2 ,dst) \ 
vperm2f 128 u '/.0 , u '/."src2 ", u 7."srcl \ 
dst " u " : : "g" (0x02) ) ; 

VCDPYALLCsrc ,dst) \ 
vmovaps u 7,0, u 7,"src " , u °/ " dst ) ; 

VBCASTOCsrc, dst) \ 
vshuf ps u %0 , u % " sr c ", u 7o"src \ 



27 


" , u 7."dst " u " 


28 


#define VBCAST 


29 


asm ( " vshuf ps 


30 


" , u 7."dst " u " 


31 


#define VBCAST 


32 


asm ( " vshuf ps 


33 


" , u 7."dst " u " 


34 


#define VBCAST 


35 


asm ( "vshuf ps 


36 


" , u 7."dst " u " 


37 


#define VMIX0C 


38 


asm ( " vshuf ps 


39 


" , u 7. " d s t " u " 


40 


#define VMIXK 


41 


asm ( "vshuf ps 


42 


" , u 7."dst " u " 


43 


#define VADDPS 


44 


asm ( " vaddps u 


45 


#define VSUBPS 


46 


asm ( " vsubps u 


47 


#define VSUBPS 


48 


asm ( " vsubps u 


49 


" u " : : " m " Cmem 


50 


#define VMULPS 


51 


asm ( " vmulps u 


52 


#define VRSQRT 


53 


asm ( " vrsqrtp 


54 


#define VMINPS 


55 


asm ( " vminps u 


56 


#define VPSRLD 


57 


asm ( " vpsrld u 


58 


#define VPSLLD 


59 


asm ( " vpslld u 


60 


#define PREFET 


61 


asm ( "pref et c 



: : "g"(0x00)) ; 

1 ( sre , dst ) \ 

u%0 )U 7o"src ", u 7 "src \ 

: : "g"(0x55)) ; 

2(src, dst) \ 

u%0 )U 7o"src ", u 7 "src \ 

: : "g" (Oxaa) ) ; 

3(src, dst) \ 

u7 e O, u 7o"src ", u 7."src \ 

: : "g"(0xff)) ; 

srcl, src2, dst) \ 

u7.0 , u 7."src2 " )U 7."srcl \ 

: : M g"(0x88)) ; 

srcl ,src2, dst) \ 

u7.0 , u 7."src2 ", u 7."srcl \ 

: : "g"(0xdd)) ; 

Csrcl, src2 , dst) \ 

" srcl "," sre 2 "," dst); 

Csrcl, src2 , dst) \ 

" srcl "," sre 2 "," dst); 

_M(mem256, sre, dst) \ 

7.0, u 7."src " , u 7."dst \ 

256)) ; 

Csrcl, src2 , dst) \ 

" srcl "," sre 2 "," dst); 

PSCsrc, dst) \ 

s u " sre "," dst); 

Csrcl, src2 , dst) \ 

" srcl ", u " src2 "," dst); 

Cimm, srcl, src2) \ 

7.0 , u 7. "srcl " , u 7. "src2 : : " I" Cimm) ) ; 

Cimm, srcl, src2) \ 

7.0 , u 7." srcl " , u 7."src2 : : " I" Cimm) ) ; 

CHCmem) \ 

ht U 7.0 " : : "m" Cmem) ) ; 
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Furthermore, we define aliases of XMM and YMM reg- 
isters. Table 2 and 3 show the aliases of YMM registers in 
calculating Newton's force and an arbitrary shaped cen- 
tral force, respectively Aliases with suffix "_X" indicate the 
lower 128-bit of the original YMM register which can be 
used as XMM registers for the SSE instructions. Note that 
some of aliases are reused for data other than described in 
Table 2 and 3. 

3.3. Newton's force 

Figure 3 is a schematic illustration of a force loop to 
compute the Newton's force on four i-particles with AVX 



Table 1 

Descriptions of the macros for inline assembly codes. One 'value' denotes a single-precision floating-point number. 



VZEROALL 


zero out all the YMM registers. 


VLOADPS (mem256 , dst ) 


load eight packed values in mem256 to dst. 


VSTORPS (src ,mem256) 


store eight packed values in src to mem256. 


VLOADPS (meml28 , dst) 


load four packed values in meml28 to dst. 


VSTORPS (src ,meml28) 


store four packed values in src to meml28. 


VLO ADLPS (mem64 , dst ) 


load two packed values in mem64 to the lower 64- bit of the lower 128- bit in dst. 


VLO ADHPS (mem64 , dst ) 


load two packed values in mem64 to the upper 64-bit of the lower 128- bit in dst. 


VBCASTL128(src,dst) 


broadcast data in the lower 128-bit of src to the lower and upper 128-bit of dst. 


VC0PYU128T0L128(src,dst) 


copy the upper 128- bit in src to the lower 128- bit in dst. 


VGATHERL128(srcl,src2,dst) 


copy the lower 128-bit in srcl and src2 to the upper 128-bit and lower 128-bit in dst, respectively. 


VCOPYALL(src.dst) 


copy 256- bit data from src to dst. 


VBCASTn(src,dst) 


broadcast the n-th element of each of the lower and upper 128-bit to all the other elements. 


VMIX0(srcl,src2,dst) 


operate data as shown in Figure 2. 


VMIX1 (srcl, src2, dst) 


operate data as shown in Figure 2. 


VADDPS (srcl , src2 , dst) 


add srcl to src2, and store the result to dst. 


VSUBPS(srcl,src2,dst) 


subtract srcl from src2, and store the result to dst. 


VSUBPS_M(mem256, src, dst) 


subtract mem256 from src, and store the result to dst. 


VMULPS (srcl, src2, dst) 


multiply srcl by src2, and store the result to dst. 


VRSQRTPS(src,dst) 


compute the inverse-square-root of src, and store the result to dst. 


VMINPS (srcl, src2, dst) 


compare the values in each pair of elements in srcl and src2, and store the not larger ones to dst. 


VPSRLD ( imm , src , dst) 


shift each element in the lower 128-bit of src left by imm bit, and store the result to dst. 


VPSRRD ( imm , src , dst) 


shift each element in the lower 128-bit of src right by imm bits, and set the result to dst. 


PREFETCH (mem) 


prefetch data on mem to the cache memory. 
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Table 2 

Aliases of YMM registers for calculating Newton's forces in List 



srcl 

src2 

dst 

srcl 

src2 

dst 



Fig. 2. Instructions MIXO and MIX1. Each set of four boxes indicates 
the lower (or upper) 128-bit of a YMM register. Each box contains 
a single-precision floating-point number. 



Alias 


ID 


Description 


XI 
YI 
ZI 


XymmO 
7 ymml 
7 ymm2 


x, y, and 2-coordinates of i-particlcs 
(xi, yi, and z<) 


EPS2 


7ymm3 


square of the gravitational softening length (e ) 


AX 
AY 
AZ 


7 ymm4 
7 ymm5 
7 ymm6 


forces of i-particles 
(ax,i, a y ,i, and a Zji ) 


PHI 


7 ymm7 


gravitational potentials of i-particles ((f>i) 


XJ 
YJ 
ZJ 


7ymm8 
7 ymm9 
y.ymmlO 


x, y, and ^-coordinates of j-particles 
(xj, yj, and Zj) 


MJ 


7oymmll 


masses of j-particles (rrij) 


DX 
DY 
DZ 


7oymml2 
7.ymml3 
7,ymml4 


relative coordinates between i- and j'-particles 
\Xij , yij , and Zij ) 



instructions. In this figure, wc depict only the lower 128- 
bit of YMM registers just for simplicity, while, in actual 



computation, the upper 128-bit is used to compute forces 
on the same four i-particle exerted by another j'-particle. 
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Fig. 3. A schematic illustration of the force loop. Each set of four boxes indicates the lower 128-bit of a YMM register. Each box contains a 
single-precision floating-point number. Note that some of aliases are reused to store data other than described in Table 3. 



The overall procedures to calculate the force on four i- 
particles using AVX instructions are summarized as follows: 

0. Zero out all the YMM registers, and load the x, y, and 
z coordinates of four i-particles, and squared soften- 
ing lengths to the lower 128-bit of XI, YI, ZI, and 
EPS2 (i.e. XI_X, YI_X, ZI_X, and EPS2_X), and copy 
them to the upper 128-bit of XI, YI, ZI, and EPS2, 
respectively. 

1. Load the x, y, and z coordinates and the masses of 
two j-particles to XJ. 

2. Broadcast the x, y, and z coordinates and the masses 
of two j-particles in XJ to XJ, YJ, ZJ, and MJ, respec- 



tively. 

3. Subtract XI, YI, and ZI from XJ, YJ, and ZJ respec- 
tively. The results (xy, yij, and Zij) are stored in DX, 
DY, and DZ, respectively. 

4. Square Xij in DX, yij in DY, and Zy in DZ and sum 
them up to compute the squared distance between 
two j-particles and four z-particles. The results are 
stored in the alias YJ. The squared softening lengths 
EPS2 are also added. Eventually, the softened squared 
distances f 2 - = r 2 - + e 2 between two j-particles and 
four i-particles are stored in YJ. 

5. Calculate inverse-square-root for f 2 - in YJ, and store 



Table 3 

Aliases of YMM registers for calculating an arbitrary force shape in 

List 5, 



Alias 


ID 


Description 


X2 
Y2 
Z2 


7 yminO 
7,ymml 
7 ymm2 


squared inter-particle distances 


TWO 


7 ymin3 


constant value of 2 . in single- precision 


AX 
AY 
AZ 


7 ymm4 
7,ymin5 
7 ymm6 


forces of i-particles 


R2CUT 


7oymm7 


cutoff radius squared 


BUFO 
BUF1 
BUF2 


7oymm8 
7oymm9 
7.ymml0 


buffers used to refer a look-up table 


MJ 


7.ymmll 


masses of j-particles 


DX 
DY 
DZ 


7.ymml2 
7.ynunl3 
7.ymml4 


relative coordinates between i- and j-particlcs 


ZI 


7«ynunl5 


z-componcnts of positions of i-particles 



the result 1/ftj in the alias ZJ. 

6. Multiply 1/fij in ZJ by mj in MJ to obtain rrij/rij, 
and store the results in MJ. 

7. Accumulate rrij/fij in MJ into <pi in PHI. 

8. Square 1/fij in ZJ, multiply the result 1/f ? by rrij/fij 
in MJ, and store them rrij/rfj in YJ. 

9. Multiply Xij in DX, y^ in DY, and Zij in DZ by rrij/f^ 
in YJ obtaining the forces (mj^/ff-, mjj/jj/ff.-, and 
m-jZij/rfj), and accumulate them into AX, AY, and AZ, 
respectively. 

10. Return to step 1 until all the j-particles are processed. 

11. Operate sum reduction of partial forces and poten- 
tials in the lower and upper 128-bits of AX, AY, AZ, 
and PHI, and store the results in the lower 128-bit of 
AX, AY, AZ, and PHI, respectively. 

12. Store forces and potentials in the lower 128-bit of AX, 
AY, AZ, and PHI to the structure Fodata. 

The function GravityKernel to compute the forces is 
shown in List 3. The order of instructions in List 3 is slightly 
different from that described above in order to obtain high 
issue rate of the AVX instructions by optimizing the order 
of operations so that operands in adjacent instruction calls 
do not have dependencies as much as possible. Further op- 
timization is given by explicitly unrolling the force loop, 
which does not appear in the list. 

List 3. A force loop to calculate the Newton's force using the AVX 
instructions. 



6 
7 
8 
9 
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11 
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13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
20 
27 
28 
29 
30 
31 
32 
33 
84 
35 
36 
37 
38 
39 
40 
41 
42 
43 
44 
45 
40 
47 
48 
49 
50 
51 
52 
53 
54 
55 
56 
57 
58 
59 
60 
61 
62 
63 
64 
65 
66 
67 
68 
69 
70 
71 
72 
73 
74 
75 
76 
77 
78 



PREFETCH (jpdat a [0] ) ; 

VZER0ALL; 

VL0ADPS(*ipdata->x, XI_X); 
VL0ADPS(*ipdata->y , YI_X); 
VL0ADPS(*ipdata->z, ZI_X); 
VL0ADPS(*ipdata->eps2 , EPS2_X) ; 
VBCASTL128CXI , XI) ; 
VBCASTL128CYI , YD ; 
VBCASTL128CZI , ZI) ; 
VBCASTL128CEPS2 , EPS2); 

VL0ADPS(*(jpdata) , XJ); 
j pdat a += 2 ; 

VBCASTKXJ, YJ); 

VBCAST2CXJ, ZJ); 

VBCAST3CXJ, MJ); 

VBCASTCKXJ, XJ); 

for(j = ; j < nj ; j += 2) { 
VSUBPSCYI, YJ, DY) 
VSUBPSCZI, ZJ, DZ) 
VSUBPSCXI, XJ, DX) 



VMULPSCDZ, DZ, ZJ) 
VMULPSCDX, DX, XJ) 
VMULPSCDY, DY, YJ) 



VADDPSCXJ , ZJ , ZJ) ; 
VADDPSCEPS2 , YJ , YJ) ; 
VADDPSCYJ , ZJ , YJ) ; 

VL0ADPS(*(jpdata) , XJ) 
j pdat a += 2 ; 

VRSQRTPSCYJ , ZJ) ; 

VMULPSCZJ , MJ , MJ) ; 
VMULPSCZJ , ZJ , YJ) ; 

VMULPSCMJ , YJ , YJ) ; 
VSUBPSCMJ , PHI , PHI) ; 



VMULPSCYJ, DX, DX) 
VMULPSCYJ, DY , DY) 
VMULPSCYJ, DZ , DZ) 



VBCASTKXJ, YJ) 

VBCAST2CXJ, ZJ) 

VBCAST3CXJ, MJ) 

VBCASTOCXJ, XJ) 



VADDPSCDX, AX, AX) 
VADDPSCDY, AY, AY) 
VADDPSCDZ, AZ , AZ) 



} 



VC0PYU128T0L128UX , DX_X) 
VADDPSCAX , DX , AX) ; 
VC0PYU128T0L128CAY, DY_X) 
VADDPSCAY , DY , AY) ; 
VC0PYU128T0L128UZ , DZ_X) 
VADDPSCAZ , DZ , AZ) ; 
VC0PYU128T0L128CPHI , MJ_X) 
VADDPSCPHI , MJ , PHI) ; 



VST0RPS(AX_X , *f odata->ax) ; 

VST0RPS(AY_X , *f odata->ay) ; 

VSTDRPS(AZ_X , *f odata->az) ; 

VSTDRPS(PHI_X , *f odata->phi) ; 



void GravityKernel (plpdata ipdata, 


pFodata fodata, 


pjpdata jpdata , 


int n j ) 


{ 




int j ; 





3.4. Central force with an arbitrary shape 

In this section, we describe how to accelerate the compu- 
tation of arbitrarily shaped forces f(r)/r, using the AVX 
instructions, where f(r) is a user-specified function in equa- 
tion (2). Note that the inter-particle softening is also ex- 
pressed in the force shape function f(r), as well as the 
long range cut-off. Arbitrary shaped softening including the 
Plummer softening, 52 softening, etc. can be set. The func- 
tion f(r) is assumed to shape; almost constant at r < e, 
rapidly decreases at larger r, and reaches zero at r = r cut . 
Such assumptions are satisfied in the inter-particle force 
calculations of PPPM or TreePM methods. 

In order to calculate central forces with an arbitrary 
shape in equation (2), we refer to a pre-calculatcd look-up 
table of f(r)/r and use the linear interpolation between 
the sampling points. In § 3.4.1 and 3.4.2, we describe our 
scheme to construct the look-up table, and procedure to 
calculate the force by using the look-up table with the AVX 
instructions, respectively. 



is affinc-transformed to a single-precision floating-point 



number s = r (s r 
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x 2)/r; 
c , where s 



2 
cut 



2 so that s is in the range 



= 2 and 



- o2 



2 2 (2 



\/2 F ). Here, E and F are the pre-defined positive integers, 
and the numbers of exponent and fraction bits extracted 
in computing the indices of the look-up table, respectively. 
Binary expressions of s m ; n and s max in the IEEE754 format 
of single-precision (32-bit) in the case of E = 4 and F = 6 
are shown in Table 4. Except that the most significant bit 
of the exponent part is always 1, all the bits of s m j n are 0, 
and as for s max , only the lower E bits of the exponent and 
the higher F bits of the fraction are 1. Next, the indices 
of the look-up table for the squared distances r 2 or s are 
computed by extracting the lower E bits of the exponent 
and the higher F bits of the fraction of s (underlined portion 
of exponent and fraction bits in Table 4) and reinterpreting 
it as an integer. This procedure can be done by applying a 
logical right shift by 23 — F bits, and a bitwise-logical AND 
with 2 E+F — 1 to s. It should be noted that the resulting 
size of the look-up table is 2 E+F . 



3.4.1. Construction of an optimized look-up table 

In terms of numerical accuracy, the look-up table is pre- 
ferred to have a large number of sampling points between 
< r < r cut . On the other hand, the size of the look-up 
table should be as small as possible to avoid cache misses 
for fast calculations. Thus, it is important how to choose 
sampling points of the look-up table in order to satisfy such 
exclusive requirements: accuracy and fast calculations of 
forces. 

In many previous implementations, sampling points of 
the look-up table are chosen so that the sampling points 
have equal intervals in a squared inter-particle distance r 2 
at < r < r cut . However, the sampling with equal intervals 
in r 2 is not a good choice, because it has coarser intervals 
at a smaller inter-particle distance, and the force shape at 
r < e is poorly sampled if the number of sampling points 
is not large enough, while the shape at r ~ r cut is sam- 
pled fairly well, or even redundantly (see the top panel of 
Figure 6). Typically speaking, tens of thousand sampling 
points in the region < r < r cut are required to assure the 
sufficient force accuracy if sampling with equal intervals in 
r 2 is adopted. Such look-up tables need several hundred 
kilobytes in single-precision, and do not fit into a low-level 
cache memory. 

The desirable sampling of the force shapes, therefore, 
should have almost equal intervals in r at short distances 
r < e, and intervals proportional to r (or equal intervals in 
lnr) at long distances. In the following, we realize such a 
sampling by adopting rather a new binning scheme, with 
which we can compute the force efficiently. 

Here, we consider to construct a look-up table of f(r)/r in 
the range of < r < r cut ■ In our binning scheme, the indices 
of the look-up table are calculated by directly extracting 
the fraction and the exponent bits of the IEEE754 format of 
squared inter-particle distances. First, the squared distance 



Table 4 

s- values, their exponent and fraction bits in the IEEE754 expressions, 
and their indices in the table for r = 0, r cu t/2 and r cu t in the case of 
E = A and F = 6 (underlined portion of exponent and fraction bits) . 



r 


s 


exponent bits 


fraction bits 


index 





2 t,Smin/ 


10000000 


00000000000000000000000 











r C ut/2 


3.2514 x 10 4 


10001101 


11111100000001100000000 


895 








r C ut 


1.3005 x 10 5 

(Smax ) 


10001111 


1 1 1 1 1 1 00000000000000000 


1023 









An affinc-transformed squared distance at a sampling 
point with an index specified by a lower E exponent bits 
6e and an upper F fraction bits 6p is expressed as 



s bE ,b E = 2 &E+1 [ 



2 F 



(0 < 6 E < 2 E , < b F < 2 1 



(3) 

The ratio between inter-particle distances whose affine- 
transformed squared distances are S(i> B +i),i> F and Sf, E ,& F is 
given by 



r {b E + l),b F 
r b E ,b F 



>(b E + l),b F 



-2 



££>E,frF 



1/2 



2 1 / 2 , 



(4) 



where 6e 3> 1 is assumed for the last approximation. 
The interval between inter-particle distances whose affinc- 
transformed distances are s& E ,(&p+i) and Sb E: b F is calculated 
as 



r b E ,(bF + l) ~ r b E .b E 



Sb E: (b F + l) - 2 



1/2 



/2 6e+1 /2 f+2 



{2 F + b F yn v 



-2 



1/2 



1/2 



(5) 



where we also assume &e ~^> 1 and F ^S> 1 for the last 
approximation. Therefore, the sampling points with the 
same fraction bits are distributed uniformly in logarithmic 
scale, and those with the same exponent bits are aligned 
uniformly in linear scale unless the fraction bit is small. 

As an example, we illustrate how the sampling points of 
the look-up table depend on the pre-defined integers E and 
F in Figure 4. We first see the cases in which either of E and 
F is zero, in order to see the roles of the integers E and F. As 
seen in Figure 4, the intervals of sampling points are roughly 
uniform in linear scale for the case E = (the bottom line 
in the top panel), and uniform in logarithmic scale for the 
case F = (the middle line in the bottom panel), unless 
r/r cut is small. As expected above, the integers E and F 
control the number of sampling points in logarithmic and 
linear scales, respectively. 

By comparing the sampling points with (E, F) = (4, 0) 
and those with (4,2) (see the top panel of Figure 4), it 
can be seen that all intervals of the sampling points with 
(E, F) = (4, 0) (indicated by the vertical dashed lines and 
double-headed arrows) are divided nearly equally into 2 F = 
4 regions by the sampling points with (E,F) = (4,2). Thus, 
our binning scheme is a hybrid of the linear and logarithmic 
binning schemes. 

Figure 5 shows the comparison of the several binning in 
which the number of sampling points is fixed to 2 E+F = 
2 6 . One can see that the binning with (E, F) = (4, 2) has 
sufficient sampling points in the range of 10 -3 < r/r cut < 
10°, whereas the binning with the other sets of (E, F) only 
samples the region of 10~ 2 < r/r cut < 10°. The number 
of the extracted exponent bit E should be large enough so 
that the scale of the softening length should be sufficiently 
resolved. For example, if e/r cut < 10 -2 , E should be set to 
at least equal to or larger than 4. 

In List 4, we present routines for constructing the look- 
up table. In our implementation, the look-up table contains 
two values: one is the force at a sampling point r^, 



Gl = 



f(rk) 



(6) 



and the other is its difference from the next sampling 
point rfc+i divided by the interval of the affine-transformed 
squared distance 



Gl 



u k+i 



G° k 



Sk+i - Sk 



(7) 



where subscript k indicates indices of the look-up table, 
and is expressed as k = 2 F x 6g + bp. Using these two 
values, we can compute the linear interpolation of f(r)/r 
at a radius r with r^ < r < rk+i by G® + (s — Sk)G\. 
The G° and G\ , are stored in a two-dimensional array de- 
clared as Force_table [TBL_SIZE] [2] , where TBL_SIZE is 
the number of the sampling points {2 E+F ) and the values 
of the G° and G\ are stored in the Force_table [k] [0] 
and Force_table [k] [1] , respectively. Since the values of 
G? and Gl are stored in the adjacent memory address, we 
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Fig. 4. Sampling points of an intcr-particlc distance for a look-up 
tabic in various cases of pre-defined integers E and F. The top and 
bottom panels take horizontal axes in linear and logarithmic scales, 
respectively. 

can avoid the cache misses in computing the linearly inter- 
polated values of f(r)/r. 





List 4. Implementation of the construction of the look-up table. 


1 


#define EXP_BIT (4) 


2 


#define FRC_BIT (6) 


3 

4 
5 
6 
7 


#define TBL_SIZE (1 << (EXP_BIT+FRC_BIT) ) // 1024 


extern float For ce_t able [TBL_SIZE] [2] ; // 8 kB 


union pack32{ 


8 


float f; 


9 


uns inged int u ; 


10 


}; 


11 




12 


void generat e_f orce_t able Cf loat rcut) 


13 


{ 


14 


unsigned int tick; 


15 


float f max , r2scale, r2max; 


16 


union pack32 m32 ; 


17 




18 


float force_func (float ) ; 
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Fig. 5. Comparison of binning among the same number of sampling 
points in various cases of the integers E and F. 
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tick = (1 << (23-FRC_BIT)) ; 

flax = (1 << (1<<EXP_BIT))*(2.0-1 . 0/ ( 1< <FRC_BIT) ) ; 

r2max = rcut*rcut; 

r2scale = (f max-2 . Of ) /r2max ; 

for(i=0,m32.f=2.0f;i<TBL_SIZE;i++,m32.u+=tick) { 
float f, r2 , r; 

f=m32. f ; 

r2 = (f-2 . 0)/r2scale ; 
float r = sqrtf (r2) ; 
Force_table[i][0] = force_funcCr); 
} 

for(i=0,m32.f=2.0f; i<TBL_SIZE - 1 ; i + + ) { 

float xO = m32.f; 

m32 . u += tick; 

float xl = m32.f; 

float yO = Force_table [i] [0] ; 

float yl = (i==TBL_SIZE-l) ? 0.0 

: Force_table [1 + 1] [0] ; 

Force_table [i] [1] = (yl -yO) / (xl -xO) ; 
} 
Force_table [i] [1] = . Of ; 



In Figure 6, we compare the conventional binning with 
equal intervals in squared distances to our binning with 
E = 4 and F = 2 (i.e. 64 sampling points), for the S'2-forcc 
shape (Hockney & Eastwood, 1981) used in the PPPM 
scheme. Although we adopt F = 5 in the rest of this pa- 
per, we set F = 2 here just for good visibility of the differ- 
ence of the two binning schemes. It should be noted that 
the number of sampling points is the same (64) in both 
schemes. Compared with the conventional binning scheme 
in the top panel, our binning scheme can faithfully repro- 
duce the given functional form even at distances smaller 
than the gravitational softening length. 

3.4.2. Procedure of force calculation 

In calculating the arbitrary central forces, the data of 
i- and j-particles are stored in the structures Ipdata and 




r/r cut 

Fig. 6. Binning of f(r)/r in the conventional scheme with 64 constant 
intervals in r 2 (top panel) and in our scheme with E = 4 and F = 2 
(bottom panel) between [0, r C ut]. Although we adopt F = 5 elsewhere 
in this paper, we set F = 2 here for viewability. R(r, e) — R(r, r cu t) 
is assumed as a functional form of f(r), in which R(r,rj) is the 
S'2-profile (Hockney & Eastwood, 1981) (see equation (16)). Solid 
lines indicate the shape of f(r)/r. Vertical dashed lines in both panels 
are the locations of the gravitational softening length e. 

Jpdata, respectively, in the same manner as described in 
the case for calculating the Newton's force, except that the 
coordinates of i- and j-particles are scaled as 



^CUt/V^ 



(8) 



so that we can quickly compute the affmc-transformed 
squared inter-particle distances between i- and j-particlcs. 
As in the case of the Newton's force, we compute the forces 
of four i-particles exerted by two j-particles using the AVX 
instructions. Using the scaled positions of the particles, 
the calculation of the forces is performed in the force loop 
as follows; 

(i) Calculate an afhne-transformed distance between i- 
and j-particles, s, as 



iin(|fj 



r,\ 2 



2,Sr 



(9) 



(ii) 



(iii) 



(iv) 



(v) 



where the function "min" returns the minimum value 
among arguments. 

Derive an index k of the look-up table from the affinc- 
transformed squared distance, s, computed in the pre- 
vious step by applying a bitwise-logical right shift by 
23 — F bits and reinterpreting the result as an integer. 
Refer to the look-up table to obtain G° and G\ . Note 
that the address of the pointer to f cut is decremented 
by l«(30-(23-F) ) in advance (see line 24 in List 5) 
to correct the effect of the most significant exponen- 
tial bit of s. 

Derive an affinc-transformed distance s& that corre- 
sponds to the fc-th sampling point ru by applying a 
bitwise- logical left shift by 23— F bits to k and reinter- 
preting the result as a single-precision floating-point 
number. 

Compute the value of f{\rj — ri\)/\rj — r»| by the 
linear interpolation of G® and G° +1 . Using the values 
of G° and G\, the interpolation can be performed as 



f(\ r j - r i 



= Gl + Gl(s- s k ) . 



(vi) Accumulate scaled "forces" on i-particles as 



A' 



&i = J2 '' 



f(\rj-ri\),z 



(fj - h) 



(10) 



(ii) 



After the force loop, the scaled "forces" are rescaled back 
as 

a, = - CUt _ «»■ (12) 

The actual code of the force loop for the calculation of the 
central force with an arbitrary force shape is shown in List 5. 
Note that bitwise-logical shift instructions such as VPSRLD 
and VPSLLD can be operated only to XMM registers or the 
lower 128-bit of YMM registers. In order to operate bitwise- 
logical shift instructions to data in the upper 128-bit of a 
YMM register, we have to copy the data to the lower 128-bit 
of another YMM register. Bitwise-logical shift operations 
to the upper 128-bit of YMM registers are supposed to be 
implemented in the future AVX2 instruction set. Also note 
that we cannot refer to the look-up table in a SIMD manner 
and have to issue the VLOADLPS and VLDADHPS instructions 
one by one (see lines 89-92 and 94-97 in List 5). Except for 
those operations, all the other calculations are performed 
in a SIMD manner using the AVX instructions. 

List 5. Implementation of arbitrary force calculation using AVX 
instructions. 
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#def ine 


FRC.BIT (6) 


#def ine 


ALIGN32 ..attribute.. ((aligned(32))) 


#def ine 


ALIGN64 ..attribute.. C(aligned(64))) 


typedef 


float v4sf ..attribute.. C(vector_size(16))); 


typedef 


struct ipdata_reg{ 


float 


x[8] ; 


float 


y[8] ; 


} Ipdata_reg, *plpdata_reg; 


void GravityKernel(pIpdata ipdata, 




p Jpdat a jp , 
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90 
91 
92 



pFodata fodata, 

int nj , 

float f cut [] [2] , 

v4sf r2cut , v4sf accscale) 

int j ; 

unsigned long int ALIGN64 idx[8] 

= {0, 0, 0, 0, 0, 0, 0, 0}; 

Ipdata_reg ALIGN32 ipdata_reg; 

static v4sf two = {2. Of, 2. Of, 2. Of, 2. Of}; 

fcut -= (K<(30-(23-FRC_BIT) )) ; 

VZER0ALL; 

VL0ADPS(ipdata->x[0] , X2_X); 
VL0ADPS(ipdata->y [0] , Y2_X); 
VL0ADPS(ipdata->2 [0] , Z2_X); 
VL0ADPS(r2cut , R2CUT_X); 
VLOADPSCtwo, TWD_X); 
VBCASTL128CX2 , X2) ; 
VSTDRPSCX2, ipdata_re 
VBCASTL128CY2 , Y2) ; 
VSTDRPSCY2, ipdata.re 
VBCASTL128CZ2 , ZI) ; 
VBCASTL128CR2CUT , R2CUT) ; 
VBCASTL128CTW0 , TWO); 



VLOADPSOjp , MJ) 
JP += 2; 

VBCAST0CMJ, X2) 
VBCASTKMJ, Y2) 
VBCAST2CMJ, Z2) 



VSUBPS_M(*ipdata_reg.x, X2 , DX); 
VMULPSCDX , DX , X2) ; 
VADDPSCTW0, X2 , X2); 

VSUBPS_M(*ipdata_reg.y , Y2 , DY); 
VMULPSCDY , DY , Y2) ; 
VADDPSCX2 , Y2 , Y2) ; 



x[0]) ; 
y[0]) ; 



VSUBPSCZI, Z2, DZ) 

VMULPSCDZ, DZ, Z2) 

VADDPSCY2, Z2, Y2) 

VBCAST3CMJ, MJ); 

VMULPSCMJ, DX, DX) 

VMULPSCMJ, DY , DY) 

VMULPSCMJ, DZ, DZ) 



VMINPSCR2CUT , Y2 , Z2); 



f or(j =0; j < nj ; 
VLOADPSOjp, MJ) 
JP += 2; 



j += 2){ 



VC0PYU128T0L128CZ2 , Y2_X); 
VPSRLD(23-FRC_BIT , Y2_X , Y2_X) 
VPSRLD(23-FRC_BIT , Z2_X , X2_X); 

VST0RPS(X2_X , idx[0]); 
VST0RPS(Y2_X , idx[4]); 



VPSLLD(23-FRC_BIT . 
VPSLLD(23-FRC_BIT . 



Y2_X , Y2.X) 
X2_X, X2_X) ; 



VGATHERL128CY2 , X2 , Y2); 
VSUBPSCY2 , Z2 , Z2) ; 

VBCAST0CMJ, X2); 
VBCASTKMJ, Y2); 

VSUBPS_M(*ipdata_reg.x, X2 , X2); 
VSUBPS_M(*ipdata_reg.y , Y2 , X2); 

VL0ADLPS(*f cut [idx [4]] , BUF0_X) 

VL0ADHPS(*f cut [idx [5] ] , BUF0_X) 

VLOADLPSOf cut [idx [0] ] , BUF1_X) 

VL0ADHPS(*f cut [idx[l]] , BUF1_X) 
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VGATHERL128CBUF0 , BUF1 , BUF1); 




VL0ADLPS(*f cut [idx [6]] , BUF2.X) 




VL0ADHPS(*f cut [idx [7] ] , BUF2_X) 




VL0ADLPS(*f cut [idx [2] ] , BUFO.X) 




VLOADHPS(*f cut [idx [3]] , BUF0_X) 




VGATHERL128CBUF2 , BUFO , BUF2); 


VMIXKBUF1, BUF2 , BUFO); 


VMIX0CBUF1, BUF2 , BUF2); 


VMULPSCZ2, BUFO, BUFO); 


VBCAST2CMJ, Z2); 


VBCAST3CMJ, MJ); 


VSUBPSCZI , Z2 , Z2) ; 


VADDPSCBUFO, BUF2 , BUF2); 


VMULPSCBUF2 , DX , DX) ; 


VMULPSCBUF2 , DY , DY) ; 


VMULPSCBUF2 , DZ , DZ) ; 


VADDPSCDX, AX, AX) ; 


VADDPSCDY, AY, AY) ; 


VADDPSCDZ , AZ , AZ) ; 


VC0PYALLCX2 , DX) ; 


VC0PYALLCY2 , DY) ; 


VC0PYALLCZ2 , DZ) ; 


VMULPSCX2 , X2 , X2) ; 


VMULPSCY2 , Y2 , Y2) ; 


VMULPSCZ2 , Z2 , Z2) ; 


VADDPSCTWO, X2 , X2); 


VADDPSCZ2 , Y2 , Y2) ; 


VADDPSCX2 , Y2 , Y2) ; 


VMULPSCMJ , DX, DX) ; 


VMULPSCMJ , DY, DY) ; 


VMULPSCMJ , DZ , DZ) ; 


VMINPSCR2CUT , Y2 , Z2); 
} 
VC0PYU128T0L128UX , X2_X); 


VADDPSCAX, X2 , AX) ; 


VC0PYU128T0L128CAY, Y2_X); 


VADDPSCAY, Y2 , AY) ; 


VC0PYU128T0L128CAZ , Z2_X); 


VADDPSCAZ , Z2 , AZ) ; 


VMULPS_M(accscale , AX_X , AX_X); 


VMULPS_M(accscale , AY_X , AY_X); 


VMULPS_M(accscale , AZ_X , AZ_X); 


VSTDRPS(AX_X , *f odata->ax) ; 


VSTORPS(AY_X , *fodata->ay) ; 


VSTORPS(AZ_X , *fodata->az) ; 
> 



List 6. Code fragment to parallelize the calculations using OpenMP 
p rogramming interface, 



Although the AVX instruction set takes the non-destructive 
3-operand form, the copy instruction between registers ap- 
peared in the code above, which was intended to avoid the 
inter-register dependencies. 

3.5. Parallelization on multi-core processors 

On multi-core processors, we can parallelize the calcu- 
lations of the forces of i-particles for both of the New- 
ton's force and arbitrary central forces using the OpenMP 
programming interface by assigning a different set of four 
i-particles onto each processor core. List 6 shows a code 
fragment for the parallelization of the computations of the 
Newton's force. The calculation of an arbitrary force can 
be parallelized similarly to that of Newton's force. 
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#define ISIMD 4 

extern Ipdata ipos [NI.MEMMAX / ISIMD]; 

extern Jpdata j pos [N J_MEMMAX] ; 

extern Fodata iacc [NI.MEMMAX / ISIMD]; 

int nig = ni / ISIMD + (ni 7. ISIMD ? 1 : 0) 

#pragma omp parallel for 
for(i = 0; i < nig; i++) 

GravityKernel(&ipos[i], &iacc[i], jpos, nj) 



3.6. Application programming interfaces 

With the implementations of the force calculation ac- 
celerated with the AVX instructions described above, 
we develop a set of application programming interfaces 
(APIs) for TV-body simulations, which is compatible to 
GRAPE-5 library 4 , except that our library do not sup- 
port functions to search for neighbours of a given particle. 
The APIs arc shown in List 7. g5_set_xmj sends the 
data of j-particlcs to the array of the structure Jpdata. 
g5_calculate_f orce_on_x sends the data of i-particlcs 
to the array of the structure Ipdata, and computes the 
forces and potentials of i-particles and returns them into 
the arrays ai and pi, respectively. 

In the function g5_open, we derive statistical bias of 
the fast approximation of inverse-square-root, VRSQRTPS in- 
struction. As Nitadori et al. (2006) reported, the results of 
this instruction contains a bias which is implementation- 
dependent. We statistically correct this bias in the same 
way as Nitadori et al. (2006). 

Softening length and the number of j-particles are set by 
the functions g5_set_eps_to_all and g5_set_n, respec- 
tively. g5_close does nothing and is just for compatibility 
with the GRAPE-5 library. 

List 8 shows a code fragment to perform an iV-body sim- 
ulation, using this APIs. 
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List 7. APIs. 




void 


g5_ 


.open (void) ; 




void 


KB. 


.close (void) ; 




void 


RB. 


.set _eps_to_all (double eps) ; 




void 


KB. 


.set _n ( int n j ) ; 




void 


gB. 


_set_xmj(int adr, 
int nj , 

double (*xj ) [3] , 
double *m j ) ; 




void 


KB. 


_calculate_force_on_x (double 


(*xi) [3] , 






double 


(*ai) [3] , 






double 


*pi , 






int ni) 


; 







List 8. Sample code. 


int n ; 




// The number of particles 


double 


m[NMAX] ; 


// Mass 


double 


x[NMAX] [3] 


; // Position 


double 


v[NMAX] [3] 


; // Velocity 


double 


a[NMAX] [3] 


; // Force 



http : //www . kf cr . jp/downloads/g7pkg2 . 2 . l/g5user . pdf 



6 


double p[NMAX]; // Potential 




7 


double t; // Time 




8 


double tend; // Time at the finish 


t ime 


9 


double dt ; // Timestep 




10 


void t ime_int egrat or ( int , 




11 


double (*) [3] , 




12 


double (*) [3] , 




13 


double (*) [3] 




14 


double) ; 




15 


// Function for time 


int egrat ion 


16 






17 


g5_open() ; 




18 


g5_set _eps_to_all Ceps) ; 




19 


g5_set_n(n) ; 




20 


whileCt < tendH 




21 


g5_set_xmj (0 ,n,x,m) ; 




22 


g5_calculate_force_on_x(x,a,p,n); 




23 


time_integrator(n,x,v,a,dt) ; 




24 


t += dt ; 




25 


} 




26 


g5_close() ; 





For the version of arbitrary force shape, we provide a new 
API call to set the force-table through a function pointer, 
which is not compatible to the GRAPE-5 API. 



and itself, — m,/e to finally obtain the correct potential of 
the i-th particle. Note that the potential between the i-th 
particle and itself is largest among the potentials between 
the i-th particle and all the particles, since the separation 
between i-particlc and itself is zero. Thus, the subtraction 
of the "potential" due to the self-interaction causes the 
cancellation of the significant digits, and consequently de- 
grades the accuracy of the potentials. 

A remedy for such degradation of the accuracy is to avoid 
the self-interaction in the force loop. In fact, we do so in 
calculating the potentials in double-precision (</>dp) in Fig- 
ure 7. However, such treatment requires conditional bifur- 
cation inside the force loop, and significantly reduces the 
computational performance. The potentials of particles are 
usually necessary only for checking the total energy conser- 
vation, and the accuracy obtained in our implementation 
is sufficient for that purpose. For these reasons, we choose 
the original way to compute the potentials of particles in 
our implementation. 



4. Accuracy 

4.1. Newton's force 

We investigate accuracy of forces and potentials obtained 
by our implementation for Newton's force. For this pur- 
pose, we compute the forces and potentials of particles in 
the Plummer models using our implementations and com- 
pare them with those computed fully in double-precision 
floating-point numbers without any explicit use of the AVX 
instructions. For the calculations of the forces and the po- 
tentials, we adopt the direct particle-particle method and 
the softening length of 4r v /A, where r v is a virial radius of 
the Plummer model and N is the number of particles. 

Figure 7 shows the cumulative distribution of relative 
errors in the forces and the potentials of particles, 



I^avx — a DP | 



odp 



and 



10. 



AVX 



^DPl 



!>DP 



(13) 



(14) 



where oiavx an d <^avx arc the force and the potential cal- 
culated using our implementation, and odp and </>dp are 
those computed fully in double-precision. It can be seen 
that most of the particles have errors less than 10 -4 . These 
errors primarily come from the approximate invcrse-square- 
root instruction VRSQRTPS, whose accuracy is about 12-bit, 
and consistent with the typical errors of ~ 1CP 4 . 

While the errors of the forces are distributed down to less 
than 10~ 7 , the errors of the potentials are mostly larger 
than ~ 3 x 10 -5 . It can be ascribed to the way of exclud- 
ing the contribution of self- interaction to the potentials. In 
computing a potential of the i-th particle, we accumulate 
the contribution from particle pairs between the i-th parti- 
cle and all the particles including itself, and then subtract 
the contribution of the potential between the i-th particle 



4.2. Central force with an arbitrary shape 

In order to see accuracies of central forces with an ar- 
bitrary shape obtained in our implementation, we choose 
a force shape which is frequently adopted in cosmologi- 
cal A-body simulations using PPPM or TreePM methods. 
Such methods are comprised of the particle-mesh (PM) 
and the particle-particle (PP) parts which compute long- 
and short-range components of inter-particle forces, respec- 
tively. Our implementation of the calculation of arbitrarily- 
shaped central forces can accelerate the calculation of the 
PP part, in which the force shape is different from the New- 
ton's force and is expressed as 



f(r) = R(r,e)-R(r,r cut ), 



(15) 



where R(r, a) is the so-called S'2-profile with a softening 
length of a (Hockney & Eastwood, 1981) given by 



R(r, a) 



(224£ - 224£ 3 + 70£ 4 

for (0 < £ < 1) 
(l2/£ 2 - 224 + 896£ 



48£ 5 -21£ 6 )/35c 



840£ 2 + 224£ 3 + 70£ 4 



-48£ 5 + 7£ 6 )/35a 3 for (1 < £ < 2) 

I for (2 < 

(16) 
We calculate forces exerted between 4K particle pairs 
with various separations uniformly distributed in ln(r) in 
the range of 5 x 10~ 3 < r/V cut < 1 using our implemen- 
tation described in section 3.4, where IK is equal to 1024. 
We set e and r cut to 3.125 x 10~ 3 and 4.6875 x 1CT 2 , and 
masses to unity. In creating the look-up table of the force 
shape, we set E = 4 and F = 5. 

Figure 8 shows a functional form of R(r, e) (solid curve) 
and f(r) (dashed curve) in the top panel and relative errors 
of forces including both PP and PM parts, i.e. R(r,e), in 



c: 
o 

H — ' 

3 
JO 
'&_ 

H — ' 

w 

TJ 

CD 
> 

-t—> 

E 
O 



0.5 



1 ' " 

■N=1K 


N "i ' ■ i ' M 


ft 

/ i 

J i 

' i 


■N=4K 


v^ + 1 ' " 

/ ' 

/ / 
/ / 
/ / 


-N=16K 


i 


i ■ — 


forces/ 


I 

! 

' potentials- 


forces/ 


i 
i 

r 
i 
I 

I 

j potentials- 

I 

I 
.'■ 1 1 . .. 


forces/ 


I 

I 
I 

;' potentials- 

) 

1 

l 

..'. i i . .. 



10"' 10" 



10" 5 10" 4 10" 3 
Error 



10" 7 10" 6 10" 5 10" 4 10" 3 
Error 



10" 7 10" 6 10" 5 10" 4 10" 3 
Error 



Fig. 7. Cumulative distribution of errors in forces (solid curves) and potentials (dashed curves) of particles in Plummer models with N = IK, 
4K, and 16K, where IK is equal to 1024. Softening lengths are set to 4r v /7V. The errors are calculated as |o,avx — a Dp|/l a Dp| an d 
<^AVX — ^dpI/I^dpIi where a is force, <j> potential. The subscripts of "AVX" and "DP" indicate the forces and potentials obtained as our 
implementation in section 3.3, and those obtained by performing all the calculations in double-precision, respectively. We deal self-interactions 
as described in the text. 



the bottom panel as a function of r/r cut . In Figure 8, we 
can see that the relative errors are less than 10 -3 , which are 
sufficiently accurate for cosmological TV-body simulations. 

5. Performance 

In this section, we present the performance of our imple- 
mentation of the collisionless TV-body simulation using the 
AVX instructions (hereafter AVX-accelerated implemen- 
tation). For the measurement of the performance, we use 
an Intel Core i7-2600 processor with 8MB cache memory 
and a frequency of 3.40 GHz, which contains four proces- 
sor cores. In measuring the performance, Intel Turbo Boost 
Technology is disabled, and Intel Hyper-Threading Tech- 
nology (HTT) is enabled. A compiler which we adopt is GCC 
4.4.5, with options -03 -f fast-math -funroll-loops. 
To see the advantage of the AVX instructions relative to 
the SSE instructions, we also develop the implementations 
with the SSE instructions rather than the AVX instruc- 
tions both for Newton's force and arbitrarily-shaped force 
( S SE- accelerated implementation ) . 

5.1. Newton's force 

First, we show the performance of our implementation 
for Newton's force. The performance is measured by exe- 
cuting the direct particle-particle calculation of the Plum- 
mer model with the number of particles from 0.5K to 32K. 
The left panel of Figure 9 depicts the performances of 
the AVX- and SSE-accelerated implementations. For com- 
parison, we also show the performance of an implementa- 
tion without any explicit use of SIMD instructions (labeled 
as "w/o SIMD" in the left panel of Figure 9). The num- 
bers of interactions per second arc 2 x 10 9 in the case of 
the AVX-accelerated implementation with a single thread, 



which corresponds to 75 GFLOPS, where the number of 
floating-point operations for the computation of force and 
potential for one pair of particles is counted to be 38. The 
performances of the SSE- and AVX-accelerated implemen- 
tations with a single thread are higher than those with- 
out SIMD instructions by 10 and 20 times, respectively, 
and higher than those expected from the degree of concur- 
rency of the SSE and AVX instructions for single-precision 
floating-point number (4 and 8, respectively). This is be- 
cause a very fast instruction of approximate invcrsc-square- 
root is not used in the "w/o SIMD" implementation. On 
the other hand, the performance with the AVX-accelerated 
implementation is higher than that of the SSE-accelerated 
implementation roughly by a factor of two as expected. 

Furthermore, in the left panel of Figure 9, we show the 
performance of a GPU-accelerated A-body code based 
on the direct particle-particle method implemented using 
the CUDA language, where the GPU board is NVIDIA 
GeForce GTX 580 connected through the PCI-Express 
Gen2 xl6 link. The GPU-accelerated A-body code com- 
putes the forces and potentials of the particles using GPUs, 
and integrate the equations of the motion of the particles 
on a CPU. Thus, the communication of the particle data 
between the main memory of the host machine and the 
device memory on the GPU boards is required, and can 
hamper the total efficiency of the code. Of course, if all the 
calculations are performed on GPUs, we might not suffer 
from such overhead. However, the performance of such 
implementation cannot be fairly compared with those of 
the AVX- and SSE-accelerated implementations, because 
the communication of the particle data is inevitable when 
we perform A-body simulations with multiple GPUs or 
with multiple nodes equipped with GPUs, regardless of 
the A-body solvers such as Tree and TreePM methods. 

The performances of the AVX- and SSE-accelerated im- 
plementations arc almost independent of the total number 
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Fig. 8. Shape of R(r, e) and /(r) (upper panel) and the relative errors 
of forces of particle pairs with a separation r (bottom panel) as a 
function of r/r cu t, where the forces include both PP and PM parts. 
Here, -Ravx an d Rp>P are ; respectively, an absolute force calculated 
with our implementation and that obtained by performing all the 
calculations in double-precision without referring to the look-up ta- 
ble. The separations of particle pairs are distributed uniformly in 
ln(r) in the range of 5 X 10 -3 < r/r cu t < 1- 

of particles, N. On the other hand, the performance of the 
GPU-accelerated implementation strongly depends on the 
number of particles N, due to the non-negligible overhead 
caused by the particle data communication. For N = 0.5K, 
the performance of the GPU-accelerated implementation 
is only 5% of that for N = 32K. Thus, for small N (0.5K 
and IK), the performance of the AVX-accelerated imple- 
mentation with four threads is higher than that with GPU- 
accelerated implementation, although, for large N (4K- 
32K), the performance of the GPU-accelerated implemen- 
tation is higher than that of the AVX-accelerated imple- 
mentation. These features can be explained by the commu- 
nication overhead in the GPU-accelerated implementation. 
So far, we see the performance of our code in the case 
that both the numbers of i- and j-particles (N and Nj, 
respectively) are the same and equal to N. However, in ac- 
tual computations of forces in collisionless iV-body simula- 



tions based on various A-body solvers such as PPPM, Tree, 
and TrcePM methods, the numbers of i- and j-particlcs 
Ni and Nj are much smaller than the total number of 
particles N. In the Tree method modified for the effective 
force with external hardwares or softwares as described in 
Makino (1991), for example, JVj is the number of particles, 
for which a tree traverse is performed simultaneously and 
the resultant interaction list (size Nj) is shared, and typ- 
ically around 10-1000. Furthermore, if one adopts the in- 
dividual timestep algorithm, the number of i-particles N 
gets even smaller. The number of j-particles Nj is also de- 
creased in Tree and TrcePM methods. Therefore, we show 
the performance for typical N and Nj in the realistic sit- 
uations of typical collisionless iV-body simulations. 

The right panel of Figure 9 shows the performance of the 
AVX-accelerated implementation using four threads with 
four processor cores (black lines) and that of the GPU- 
accelerated one (red lines) for various set of A^ and Nj . It 
can be seen that the obtained performance gets lower for 
the smaller Ni and Nj, regardless of the implementations. 
For the AVX- and SSE-accelerated implementations, this 
feature is due to the overhead of storing the particle data 
into the structures Ipdata and Jpdata shown in List 1. 
The amount of the overhead of storing i- and j-particles are 
proportional to N and Nj , respectively, and the computa- 
tional cost is proportional to NNj. Keeping this in mind 
the low performance with Ni = 16 compared with those 
with N > 64 can be ascribed to the overhead of storing j- 
particles to the structure Jpdata. For the GPU-accelerated 
implementation, the overhead originates from the transfer 
of the particle data to the memory on GPUs. It can be seen 
that the performance of the AVX-accelerated implementa- 
tion has rather mild dependence on A, and Nj , while that of 
the GPU-accelerated one relatively strongly depends on Nj . 
Such difference reflects the fact that the bandwidths and 
latency of the communication between GPUs and CPUs 
are rather poor compared with those of memory access be- 
tween CPUs and main memory. Thus, the performance of 
the GPU-accelerated implementation is apparently supe- 
rior to the AVX-accelerated one only when both of Ni and 
Nj are sufficiently large (say, JV» > IK and Nj > 4K). 

At the end of this section, we apply our AVX-accelerated 
implementation to Barnes-Hut Tree method (Barnes & 
Hut, 1986), and measure its performance. Our tree code 
is based on the PP part of TreePM code implemented 
by Yoshikawa & Fukushigc (2005) and Fukushige et al. 
(2005), in which they accelerated the calculations of the 
gravitational forces of the 52-profilc using GRAPE-5 and 
GRAPE-6A systems under the periodic boundary condi- 
tion. We modify the tree code such that it can compute 
the Newton's force under the vacuum boundary condition. 
Since both of GRAPE-6A systems and Phantom-GRAPE 
library support the same APIs, we can easily utilize the 
capability of Phantom-GRAPE by simply exchanging the 
software library. 

Using the tree code described above, we calculate gravi- 
tational forces and potentials of all the particles in a Plum- 
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Fig. 9. (Left) Performance of the AVX- and SSE-accelerated implementation for Newton's force as a function of the number of particles 
(solid lines). The numbers in parentheses refer to the number of threads adopted. Dashed curve shows the performance of a GPU (NVIDIA 
GeForce GTX 580) attached to a host computer equipped with an Intel Core i7-2600K processor. (Right) Performance of the calculations of 
forces and potentials for a various set of the number of i- and j'-particles, Ni and Nj, respectively. Black and red curves show the performance 
with the AVX-accelerated implementation on four threads and that of GPU-accelerated implementation. 



mcr model and a King model with the dimcnsionlcss cen- 
tral potential depth W = 9. We measure the performance 
on an Intel Core i7-2600 processor. For the comparison 
with other codes, we also measure the performance of the 
same code but without any explicit use of SIMD instruc- 
tions, and the publicly available code bonsai (Bedorf et 
al., 2012), which is a GPU-accelerated ./V-body code based 
on the tree method. The performance of the bonsai code 
is measured on a system with NVIDIA GeForce GTX 580. 
Since the bonsai code utilizes the quadrupole moments of 
the particle distribution in each tree node as well as the 
monopolc moments in the force calculations, for a fair com- 
parison of the performance with the bonsai code, we give 
our tree code a capability to use the quadrupole moments 
in each tree node, although the original code uses only 
the monopolc moments. We represent these multipolc mo- 
ments as pseudo-particles, using pseudo-particle multipolc 
method (Kawai & Makino, 2001). Figure 10 shows the wall 
clock time to compute gravitational forces and potentials 
for each tree code. We show the both results with the code 
which uses the quadrupole moments (lower panels) and the 
one which uses only the monopole moments (upper pan- 
els). Note that the wall clock time includes the time for 
tree construction, tree traverse and calculations of forces 
and potentials but we exclude the time to integrate orbital 
motion of particles. As expected, the wall clock time with 
the AVX-accelerated implementation is roughly 10 times 
shorter than those without any explicit use of SIMD in- 
structions, owing to parallelism to calculate forces and po- 
tentials. The wall clock time with the AVX-accelerated im- 
plementation is about only three times longer than those 
with bonsai, despite that theoretical peak performance of 



Intel Core i7-2600 (220 GFLOPS) is lower than that of 
NVIDIA GeForce GTX 580 (1600 GFLOPS) by a factor of 
7.3 in single- precision. We expect that the performance of 
our AVX-accelerated implementation could be close to that 
of the bonsai in the following situations. When we adopt 
individual timestep algorithm, the number of i-particles is 
effectively decreased, and a part of GPU cores becomes in- 
active. Thus, the performance of GPU-accelerated imple- 
mentation would be degraded more rapidly than that of 
our AVX-accelerated implementation. Furthermore, when 
we use GPU-accelerated implementation on massively par- 
allel environments, the communication between CPUs and 
GPUs is inevitable, which also degrades the performance 
of GPU-accelerated implementation. 



5.2. Force with an arbitrary shape 

The left panel of Figure 1 1 shows the performance of our 
implementation to calculate forces with an arbitrary force 
shape accelerated with the AVX and SSE instructions. For 
the comparison, we also plot the performance of an imple- 
mentation without any explicit use of the SIMD instruc- 
tions. The numbers of exponent and fraction bits used to 
referring the look-up table are set to E = 4 and F = 6, 
respectively. The performance of the AVX-accelerated im- 
plementation with a single thread is 2 and 6 times higher 
than that of the SSE-accelerated one and the one without 
any SIMD instructions, respectively. These forces with the 
use of the AVX instructions are lower than those expected 
from the degree of concurrency of their SIMD operations, 8, 
mainly because the reference of a look-up table is not car- 
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ried out in a SIMD manner. The performance with multi- 
thread parallelization is almost proportional to the number 
of threads up to four threads. If the HTT is activated, the 
performance with eight threads is higher than that with 
four threads by a few percent. 

The right panel of Figure 11 shows the performance of 
the AVX-accelerated implementation with eight threads for 
a various set of JVj and Nj. For iV, > 64, the performance 
is almost independent of Ni and Nj, and for N = 16 it is 
about half the performance with Ni > 64. This is again due 
to the overhead of copying j-particle data to the structure 
Jpdata, as is the case in the calculation of Newton's force. 
Such weak dependence of the performance on Ni and Nj 
are also preferable for the calculations of the forces in the 
PPPM and TrcePM methods especially with the individual 
timestep scheme. 



6. Summary 

Using the AVX instructions, the new SIMD instructions 
of x86 processors, we develop a numerical library to ac- 
celerate the calculations of Newton's forces and arbitrar- 
ily shaped forces for TV-body simulations. We implement 
the library by means of inline-assembly embedded in C- 
language with GCC extensions, which enables us to manu- 
ally control the assignment of the YMM registers to com- 
putational data, and extract the full capability of a CPU 
core. In computing arbitrarily shaped forces, we refer to a 
look-up table, which is constructed with a novel scheme so 
that the binning is optimized to ensure good numerical ac- 
curacy of the computed forces while its size is kept small 
enough to avoid cache misses. 

The performance of the version for Newton's forces 
reaches 2 x 10 9 interactions per second with a single thread, 
which is about 2 times and 20 times higher than those of 
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the implementation with the SSE instructions and with- 
out any explicit use of SIMD instructions, respectively. 
The use of the fast inverse-square-root instruction is a key 
ingredient of the improvement of the performance in the 
implementation with the SSE and AVX instructions. The 
performance of the version for arbitrarily shaped forces is 
2 and 6 times higher than those implemented with the SSE 
instructions and without any explicit use of the SIMD in- 
structions. Furthermore, our implementation supports the 
thread parallelization on a multi-core processor with the 
OpenMP programming interface, and has a good scalability 
regardless of the number of particles. 

While the performance of our implementation using the 
AVX instructions is moderate compared with that of the 
GPU-accelerated implementation, the most remarkable ad- 
vantage of our implementation is the fact that the perfor- 
mance has much weaker dependence on the numbers of i- 
and j-particles than that of the GPU-accelerated imple- 
mentation. This feature is also the case for the calculation 
of the arbitrarily shaped force, and can be explained by 
the relatively large overhead of the data transfer between 
GPUs and main memory of their host computers. In actual 
calculations of forces with popular TV-body solvers such as 
the Tree-method and the TreePM-method combined with 
the individual timestep scheme, the numbers of i- and j- 
particles cannot be always large enough to extract the full 
capability of GPUs. In that sense, our implementation is 
more suitable in accelerating the calculations of forces us- 
ing the Tree- and TreePM-mcthods. 

Another advantage of our implementation is its portabil- 
ity. With this library, we can carry out collisionless A-body 
simulations with a good performance even on supercom- 



puter systems without any GPU-based accelerators. Note 
that massively parallel systems with GPU-based accelera- 
tors, at least currently, are not ubiquitous. Even on proces- 
sors other than the x86 architecture, most of them supports 
similar SIMD instruction sets (e.g. Vector Multimedia Ex- 
tension on IBM Power series, and HPC-ACE on SPARC64 
VHIfx, etc.) Our library can be ported to these processors 
with some acceptable efforts. 

Finally let us mention the possible future improvement of 
our implementation. Fused Multiply- Add (FMA) instruc- 
tions which have already been implemented in the "Bull- 
dozer" CPU family by AMD Corporation, and is sched- 
uled to be introduced in the "Haswell" processor by Intel 
Corporation in 2013. The use of the FMA instructions will 
improve the performance and accuracy of the calculations 
of forces to some extent. The numerical library "Phantom- 
GRAPE" developed in this work is publicly available at 
http : //code . google . com/p/phantom-grape/. 
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